Bechdel + data visualization

Programming Activity Key

This activity is meant to provide an interactive environment for students to implement what they have learned from the videos. Question types include the following:

Text Box: A space for students to write down answers to open ended questions. These questions are followed by a Solutions tab to reveal the answer.

Your Turn: These coding questions ask students to explore / edit / manipulate existing to code. These questions are followed with an Expected Solutions box so students can get immediate feedback

Demo: These coding questions have the solutions within them and allow the students to run the code and make connections.

Getting started

#| echo: false
#| context: setup

download.file(
  "https://raw.githubusercontent.com/ElijahMeyer3/Coursera/main/data/bechdel.csv",
  "bechdel.csv"
)
#| echo: false
bechdel <- read.csv("bechdel.csv")

In this mini analysis we work with the data used in the FiveThirtyEight story titled “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women”.

This analysis is about the Bechdel test, a measure of the representation of women in fiction.

Packages

We’ll use: tidyverse for majority of the analysis and scales for pretty plot labels later on. These are ready to use for you in this activity!

#| label: load-packages
#| message: false
#| warning: false
library(tidyverse)
library(scales)

Data

The data are stored as a CSV (comma separated values) file in the data folder of your repository. Let’s read it from there and save it as an object called bechdel.


    
Note

This a modified version of the bechdel dataset from the previous application exercise. It’s been modified to include some new variables derived from existing variables as well as to limit the scope of the data to movies released between 1990 and 2013.

Get to know the data

We can use the glimpse function to get an overview (or “glimpse”) of the data. Write the following code below to accomplish this task.


    

With your output, confirm that:

– There are 1615 rows

– There are 7 variables (columns) in the dataset

We can use slice to look at rows of our data. Run the following code. Change the 5 to another number to print that many rows!


    

What does each observation (row) in the data set represent?




Each observation represents a movie.

Variables of Interest

The variables we’ll focus on are the following:

  • budget_2013: Budget in 2013 inflation adjusted dollars.
  • gross_2013: Gross (US and international combined) in 2013 inflation adjusted dollars.
  • roi: Return on investment, calculated as the ratio of the gross to budget.
  • clean_test: Bechdel test result:
    • ok = passes test
    • dubious
    • men = women only talk about men
    • notalk = women don’t talk to each other
    • nowomen = fewer than two women
  • binary: Bechdel Test PASS vs FAIL binary

We will also use the year of release in data prep and title of movie to take a deeper look at some outliers.

There are a few other variables in the dataset, but we won’t be using them in this analysis.

Visualizing data with ggplot2

ggplot2 is the package and ggplot() is the function in this package that is used to create a plot. Interact with the code below by either running the code given, or adding code to achieve the expected solution when asked within the code chunk!

  • ggplot() creates the initial base coordinate system, and we will add layers to that base. We first specify the data set we will use with data = bechdel.

    
  • The mapping argument is paired with an aesthetic (aes()), which tells us how the variables in our data set should be mapped to the visual properties of the graph.

    

As we previously mentioned, we often omit the names of the first two arguments in R functions. So you’ll often see this written as:


    

Note that the result is exactly the same.

  • The geom_xx function specifies the type of plot we want to use to represent the data. In the code below, we want to use geom_point, which creates a plot where each observation is represented by a point.

    
#| echo: false
library(ggplot2)
ggplot(bechdel, 
       aes(x = budget_2013, y = gross_2013)) +
  geom_point()

Note that this results in a warning as well.

This warning represents the number of observations that were removed because there were missing data!

Gross revenue vs. budget

Step 1 - Your turn

The following code changes the color of all points to coral. Explore different colors by changing “coral” to different colors!

Tip

See http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf for many color options you can use by name in R or use the hex code for a color of your choice.


    

Step 2 - Your turn

Add labels for the title and x and y axes using labs. Do this by modifying the existing code below.


    
#| label: plot-labels

ggplot(bechdel, 
       aes(x = budget_2013, y = gross_2013))+
  geom_point(color = "deepskyblue3") + 
  labs(
    x = "Budget (in 2013 $)", 
    y = "Gross revenue (in 2013 $)", 
    title = "Gross revenue vs. budget"
    )

Step 3 - Your turn

An aesthetic is a visual property of one of the objects in your plot. Commonly used aesthetic options are:

  • color
  • fill
  • shape
  • size
  • alpha (transparency)

Modify the plot below, so the color of the points is based on the variable binary, and make the size of your points based on roi.


    
#| label: plot-aes-color-solution

ggplot(bechdel, 
       aes(x = budget_2013, y = gross_2013,
           color = binary, size = roi)) +
  geom_point() + 
  labs(
    x = "Budget (in 2013 $)", 
    y = "Gross revenue (in 2013 $)", 
    title = "Gross revenue vs. budget, by Bechdel test result"
    )

Step 4 - Your turn

Expand on your plot from the previous step to make the transparency (alpha) of the points 0.5.


    

Step 5 - Your turn

Expand on your plot from the previous step by using facet_wrap to display the association between budget and gross for different values of clean_test.


    

Step 6 - Demo

Below is a demonstration on how to improve your plot from the previous step by making the x and y scales more legible.

Tip

This is done through the use of the scales package, specifically the scale_x_continuous() and scale_y_continuous() functions.


    

Step 7 - Your turn

Expand on your plot from the previous step by using facet_grid to display the association between budget and gross for different combinations of clean_test and binary. Comment on whether this was a useful update.


    

Is this type of facet useful? Why or why not?




This was not a useful update as one of the levels of clean_test maps directly to one of the levels of binary.

Assessment Questions

1. The gg in the name of the package ggplot2 stands for …

Options

  1. Grammar of Graphics

  2. Grammar of Graphing

  3. Good Grammar

  4. Generate Graphics

2. Please answer each true or false question below as it relates to the functionality.

  1. The <- assigns the value of the result from the operation on the right hand side to the object on the left hand side

  2. geom_jitter and geom_point will produce the same scatter plot.

  3. aes can be nested within ggplot or specific geom functions (ex. geom_line)

3. Connect the set of variables with a geom that will create an appropriate visualization.

  1. Quantitative X & Quantitative Y

  2. Categorical X & Quantitative Y

  3. Categorical X

  1. geom_boxplot()

  2. geom_point()

  3. geom_bar()

The next two questions will use the mtcars data set in R. A key to the data set can be found below:

Variable Description
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears

4. Code style: Below is code that does not follow the Tidyverse style guide. Pick the correction option that changes the code to adhere to the style guide.

ggplot(data=mpg,mapping=aes(x=drv,fill=class))+geom_bar() +scale_fill_viridis_d()

Options

ggplot(mpg, aes(x = drv, fill = class)) + 
  geom_bar() + 
  scale_fill_viridis_d()
ggplot(data=mpg, aes(x = drv, fill = class)) + 
  geom_bar() + scale_fill_viridis_d()
ggplot(mpg, aes(x = drv, fill = class))+ 
geom_bar()+ 
scale_fill_viridis_d()

5. Reference the help file of the mtcars data set. Then, choose the selection that fill in the following ... to make a scatterplot of cars’ miles per gallon on the x-axis and their `weight below.

ggplot(mtcars) + 
  aes(x = ... , y = ...) + 
  geom_...()

Options

ggplot(mtcars) + 
  aes(x = mpg , y = wt) + 
  geom_point()
ggplot(mtcars) + 
  aes(x = wt , y = cyl) + 
  geom_point()
ggplot(mtcars) + 
  aes(x = mpg, y = wt) + 
  geom_scatter()
ggplot(mtcars) + 
  aes(x = wt , y = cyl) + 
  geom_scatter()